The data set is quiet well structured and the quality of the data set is good. However, there are a few problematic entries and the following section will remove these entries and modify the data set such that the data can be used easily for further analysis.
| Job Title | Location |
|---|---|
| Genomic Data Scientist | Stevenage, United Kingdom |
| Scientist, Data, Methods and Analytics Immuno-inflammation and Specialty Medicines | Stevenage, United Kingdom |
| Scientist in Data, Methods, & Analytics | Brentford, United Kingdom |
| Lead Data Analyst | Brentford, United Kingdom |
All the company names contained the ratings attached along with it eventhough it was provided in a separate column. Therefore, they were also removed.
The Size of employees column was of type character, therefore, they were converted to factors and the levels were set accordingly.
The Revenue column was all mentioned in USD, so the USD was removed from the columns and added to the column name.
The salary estimate was very messy as it contained multiple factors/ranges and there were overlapping ranges too. The estimate contained different types such as Glassdoor estimate, employer estimate and per hour estimate. This has to be separated from the estimate value for easy data usage. The estimate ranges were reconstructed so that the number of different ranges are minimised.
All the -1 values were converted to NAs
There are 21 job listings that provide salary(intervals) on a per hour basis.
Figure 1.1: Maximum and Minimum Salary comparison
Data Scientists have the highest Max salary limit and also the lowest Min Salary limit. This also shows how diverse the Data Scientist job classification can be.
Figure 1.2: Location of Business Analyst job by state
In USA, Business Analyst jobs are more popular in the state of Texas and California. The count seems to be significantly less in New York which is a very interesting observation.
Figure 1.3: Location of Data Analyst job by state
Compared to Business Analyst jobs, Data Analyst jobs are significantly lesser. Data Analyst Jobs are more popular in Texas, California and New York.
Figure 1.4: Location of Data Scientist job by state
The number of jobs for Data Scientists are comparatively higher when compared to Business and Data Analysts. This was also evident from the bar graph aove.
Figure 1.5: Ratio of different company sizes for Business Analysts
Figure 1.6: Ratio of different company sizes for Data Analysts
Figure 1.7: Ratio of different company sizes for Data Scientists
The number of startups (having lesser employee count) are higher for Business Analyst field while comapred to the rest, while Data Scientists have more oppurtunities in larger companies.
Figure 1.8: Business Analyst in various Industries
Figure 1.9: Data Analyst in various Industries
Figure 1.10: Data Scientist in various Industries
Staff Outsourcing and IT services are the major industries where these 3 job classifications are predominant.
Figure 1.11: Data Scientist in various Sectors
Figure 1.12: Data Analyst in various Sectors
Figure 1.13: Business Analyst in various Sectors
Information Technology and Business Services are the predominant sectors where wthese job classifications are required.
Figure 1.14: Maximum Salary vs Rating
This claim seems to be true based on the above graph. As it can be seen, the job ratings get higher as the salary gets higher.
Figure 1.15: Salary vs State
The salary range in California, Texas and New York are comparitively higher when compared to the rest.
Figure 1.16: Sector vs State
The sector count is higher in Texas and California when compared to the rest. This mayb also be due to the number of listings that are more in number for these 2 states.
There are 14 teams in the competition.
There are a total of 370 players.
There are a total of 7 rounds in the competition.
As it can be observed, the highest goal scorers are team Kangaroos and team Fremantle. Therefore, they are more likely to win the 2020 season.
The dataset contains 68 variables and out of which 34 are numeric variables. Since the pairs plot shows the distribution between single variables and between 2 variables, the total pair plots that can be made will be 34 * 34 = 1156. However, the variable jumper id has been duplicated thrice which makes it 31 * 31 = 961. Total would be 528 which comprises of the number of diagonals (433), upper and lower triangles.
The Scagnostics striated and stringy were used to arrive at the L-shaped plots. Since striated checks the straightness of the points and stringy checks the dispersion. This yielded the variables hitputs and bounces.
The data seemed to have a barrier where in the value does not go beyond a certain x,y value.
# Shiny
ui <- fluidPage(
plotlyOutput("parcoords"),
verbatimTextOutput("data"))
server <- function(input, output, session) {
aflw_num <- aflw_scags[,3:15]
output$parcoords <- renderPlotly({
dims <- Map(function(x, y) {
list(values = x,
range = range(0,1),
label = y)
}, aflw_num,
names(aflw_num),
USE.NAMES = FALSE)
plot_ly(type = 'parcoords',
dimensions = dims,
source = "pcoords") %>%
layout(margin = list(r = 30)) %>%
event_register("plotly_restyle")
})
ranges <- reactiveValues()
observeEvent(event_data("plotly_restyle",
source = "pcoords"),
{
d <- event_data("plotly_restyle",
source = "pcoords")
dimension <- as.numeric(stringr::str_extract(names(d[[1]]),"[0-9]+"))
if (!length(dimension)) return()
dimension_name <- names(aflw_numeric)[[dimension + 1]]
info <- d[[1]][[1]]
ranges[[dimension_name]] <- if (length(dim(info)) == 3) {
lapply(seq_len(dim(info)[2]), function(i) info[,i,])
} else {
list(as.numeric(info))
}
})
aflw_selected <- reactive({
keep <- TRUE
for (i in names(ranges)) {
range_ <- ranges[[i]]
keep_var <- FALSE
for (j in seq_along(range_)) {
rng <- range_[[j]]
keep_var <- keep_var | dplyr::between(aflw_scags[[i]],
min(rng), max(rng))
}
keep <- keep & keep_var
}
aflw_scags[keep, ]
})
output$data <- renderPrint({
tibble::as_tibble(aflw_selected())
})
}
shinyApp(ui, server)
Clumpy and Covex have relatively lower values when compared to the rest. There seems to be outliers in convex, skinny and clumpy data. Sparse and Skewed show clumpiness while the others are more spreadout.
Outlying: 0.0 - 0.2 Stringy: 0.6 Straited: 0.2 - 0.8 Skewed: 0.7 Skinny: 0.4 Splines: 0.5
Outlying: > 0.4 Stringy, Striated: > 0.8 Splines: 0
Clumpy and Convex
The pageviews of both the control and experimental data are similar in value.
The number of clicks between both the data sets are also similar in value.
Again, there is not much difference between the 2 data sets. However the number of enrollments in November is significantly lesser than October.
The number of payments are also similar among the two datasets. Payments have been higher in October.
The flow gets significantly reduced when moving from one part to another.
##
## Two Sample t-test
##
## data: chol_red by Margarine
## t = -2.5186, df = 16, p-value = 0.0228
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.29671838 -0.02550384
## sample estimates:
## mean in group A mean in group B
## 0.4855556 0.6466667
##
## Wilcoxon rank sum test with continuity correction
##
## data: chol_red by Margarine
## W = 16, p-value = 0.03388
## alternative hypothesis: true location shift is not equal to 0
## [1] -0.1611111